Add Bamba Model #10909

fabianlim · 2024-12-05T01:35:19Z

This is the companion PR to an huggingface PR for adding Bamba, which is a hybrid mamba2 architecture with SwiGLU. The checkpoints are jointly trained by IBM, Princeton, and UIUC.

In this PR we have:

Created the bamba model inference architecture, which we would like acknowledge the jamba team for referencing their implementation, whereby we modified to support full attention layers with RoPE and mamba v2.
Ensured that we have TP support.
Ensured we support chunked prefill. ~~Currently we have a partial solution, which works only when the cont batch boundaries line up with the chunked boundaries.~~ This is now completely fixed.
Ensured that we conform to the recent PR for adding pipeline support for SSM models.
Adapted the mamba v2 scan kernels into vllm/model_executor/layers/mamba/ops. Only the fwd kernels are extracted. Some modifications and fixes are made.
created tests/models/decoder_only/language/test_bamba.py with an initial ibm-fms/Bamba-9.8b-1.8T-hf. This is practically identical to test_mamba.py, only chunked prefill tests are disabled as it is currently not supported.

Currently only FlashAttention backend is supported, as we check fields like context_lens_tensor. Have not yet investigated other backends.

We would like to also acknowledge the draft codestral mamba PR from @tlrmchlsmth, which we also referenced the mixer.

we made a few simplications for bamba (simplified mixer from mamba v2)
Cuda graph capturing seems to be working, but we understand that cudagraphs are disabled for long sequence lengths. For SSM models the strength is in this regime, so can we handle it better?

Hope to discuss the following with the maintainers

~~do we have to remove all the bwd kernels?~~ yes we should
for the full attention layers, we increase the sin_cos cache to cover the sequence length, if it is longer than max_sequence_len. ~~This differs for other current models (e.g., llama). How can we better support long sequence lengths?~~ we should keep this consistent with other models, so we propose to allow the sin_cos cache extension only when VLLM_ALLOW_LONG_MAX_MODEL_LEN is specified.
~~have some ideas to support chunked pre-fill, but will appreciate some discussion with the maintainers on how to proceed.~~ working on changing the kernels to support chunked prefill.
~~since the mixer2 is simplified from mamba, should we rename it?~~ we can keep it as is, but we should document the differences from mamba_ssm

cc: @ani300, @raghukiran1224, @cyang49, @njhill

Signed-off-by: Yu Chin Fabian Lim <[email protected]>

github-actions · 2024-12-05T01:35:31Z

👋 Hi! Thank you for contributing to the vLLM project.
Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run fastcheck CI which starts running only a small and essential subset of CI tests to quickly catch errors. You can run other CI tests on top of those by going to your fastcheck build on Buildkite UI (linked in the PR checks section) and unblock them. If you do not have permission to unblock, ping simon-mo or khluu to add you in our Buildkite org.

Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging.

To run CI, PR reviewers can do one of these:

Add ready label to the PR
Enable auto-merge.

🚀

Signed-off-by: Yu Chin Fabian Lim <[email protected]>

tlrmchlsmth · 2024-12-05T21:44:22Z

Hi @fabianlim, thanks for the PR! It's really great to see progress being made on state-space models, especially for me as I unfortunately haven't been able to prioritize support for Mamba2

I'm happy to shepherd this PR and discuss any questions you have, especially to support chunked prefill. If you haven't already, can you join the developer slack for quicker discussion? (https://communityinviter.com/apps/vllm-dev/join-vllm-developers-slack)

Signed-off-by: Yu Chin Fabian Lim <[email protected]>

fabianlim · 2024-12-12T15:24:02Z

@tlrmchlsmth I cleaned up the PR quite abit, perhaps it might be a good time to get some early eyes. The chunked prefill implementation is incomplete ATM, as we discussed offline.

tlrmchlsmth

first pass, just a few comments. At a high level it looks good.

Will you add a test for tensor parallelism?

tlrmchlsmth · 2024-12-12T22:24:29Z

tests/models/decoder_only/language/test_bamba.py

+# will be ch
+MODELS = ["ibm-fms/Bamba-9.8b-1.8T-hf"]


The comment trails off, but will there be a small test model available?

@raghukiran1224 any plans for a small test model? I think since we do outputs comparison it is not that good to just have a randomly initialised small model

@fabianlim @tlrmchlsmth would it be ok to test with a random model or would you rather have a tiny model (say 200M or so) to test with?

A tiny model with nonrandom weights would be much better!

btw is there any update on this?

tests/models/decoder_only/language/test_bamba.py

vllm/model_executor/layers/mamba/mamba_mixer2.py

vllm/model_executor/layers/mamba/ops/ssd_bmm.py

Signed-off-by: Yu Chin Fabian Lim <[email protected]>

mergify · 2024-12-13T15:52:06Z

This pull request has merge conflicts that must be resolved before it can be
merged. Please rebase the PR, @fabianlim.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

fabianlim · 2024-12-13T16:23:20Z

@tlrmchlsmth i have addressed most of your comments now, not rebasing yet, waiting for you to look first. But I realized test_jamba.py has changed so I will need to do the rename and test again.

tlrmchlsmth

@fabianlim At a high level, the changes look good, and the PR looks good overall. I'll do a more thorough review once it's unmarked as draft.

Could you add unit tests for the added kernels in layers/mamba/ops?

Signed-off-by: Yu Chin Fabian Lim <[email protected]>

Signed-off-by: Tyler Michael Smith <[email protected]>

Signed-off-by: Yu Chin Fabian Lim <[email protected]>

Signed-off-by: Tyler Michael Smith <[email protected]> Co-authored-by: Yu Chin Fabian Lim <[email protected]>

Signed-off-by: Tyler Michael Smith <[email protected]>

Signed-off-by: Yu Chin Fabian Lim <[email protected]>

mergify · 2025-01-20T07:17:51Z

This pull request has merge conflicts that must be resolved before it can be
merged. Please rebase the PR, @fabianlim.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

Signed-off-by: Tyler Michael Smith <[email protected]>

yury-tokpanov · 2025-01-24T02:14:17Z

vllm/model_executor/models/bamba.py

+        lora_config = vllm_config.lora_config
+
+        self.config = config
+        self.padding_idx = config.pad_token_id


is this one used anywhere?

oh sorry good catch I will remove it

Signed-off-by: Tyler Michael Smith <[email protected]>

fabianlim · 2025-01-31T00:17:28Z

Note to self, some of the testing API has changed due to this PR #10353

Signed-off-by: Yu Chin Fabian Lim <[email protected]>

tlrmchlsmth

Looking pretty good, let's get this landed now that 4.48.2 is out!

tlrmchlsmth · 2025-02-01T17:33:08Z

vllm/model_executor/models/mamba2.py

Let's land mamba 2 in #9292

tlrmchlsmth · 2025-02-01T17:33:34Z

requirements-common.txt

Could you merge in latest main? We've already landed this change

ya actually yest i reverted this file and took it from latest main, but somehow the diff shows up in github. The version on the left shown by github is actually old

ok if i merge in latest main it seems fine..

tlrmchlsmth · 2025-02-01T17:36:30Z

requirements-test.txt

I don't think this file should be changed now that #12599 has been merged. Could you revert this file?

Also wondering why this has more changes than in #12599 - did you run into any additional issues that required these additional chagnes?

same as above, the version on the left is old. the right is from latest main

tlrmchlsmth · 2025-02-01T17:37:25Z

tests/models/decoder_only/language/test_mamba.py

Let's land these changes as part of #9292

tlrmchlsmth · 2025-02-01T17:39:21Z

vllm/model_executor/layers/mamba/mamba_mixer2.py

+# Adapted from transformers.models.mamba2.modeling_mamba2.MambaRMSNormGated
+@CustomOp.register("mixer2_gated_rms_norm")
+class Mixer2RMSNormGated(CustomOp):
+
+    def __init__(self, full_hidden_size, full_n_groups, eps=1e-6):
+        super().__init__()
+        self.tp_size = get_tensor_model_parallel_world_size()
+        self.tp_rank = get_tensor_model_parallel_rank()
+        self.full_hidden_size = full_hidden_size
+        self.group_size = full_hidden_size // full_n_groups
+        self.per_rank_hidden_size = full_hidden_size // self.tp_size
+        self.n_groups = full_hidden_size // self.group_size
+
+        self.variance_epsilon = eps
+        self.weight = nn.Parameter(torch.ones(self.per_rank_hidden_size))
+        set_weight_attrs(self.weight,
+                         {"weight_loader": sharded_weight_loader(0)})
+        assert self.full_hidden_size % self.tp_size== 0,\
+            "Tensor parallel world size must divide hidden size."
+
+    def forward_native(
+        self,
+        x: torch.Tensor,
+        gate: torch.Tensor,
+    ):
+        # Three tensor-parallel cases:
+        #   1. n_groups is 1
+        #      In this case we parallelize along the reduction dim.
+        #      Each rank computes a local sum of squares followed by AllReduce
+        #   2. tp_size divides n_groups
+        #      Each rank only reduces within its local group(s).
+        #      No collective ops necessary.
+        #   3. The general case can be pretty complicated so we AllGather
+        #      the input and then redundantly compute the RMSNorm.
+        input_dtype = x.dtype
+        x = x * nn.functional.silu(gate.to(torch.float32))
+
+        if self.n_groups == 1:
+            if self.tp_size > 1:
+                # Compute local sum and then reduce to obtain global sum
+                local_sums = x.pow(2).sum(dim=-1, keepdim=True)
+                global_sums = tensor_model_parallel_all_reduce(local_sums)
+                # Calculate the variance
+                count = self.tp_size * x.shape[-1]
+                variance = (global_sums / count)
+
+            else:
+                variance = x.pow(2).mean(-1, keepdim=True)
+            x = x * torch.rsqrt(variance + self.variance_epsilon)
+        else:
+            redundant_tp: bool = self.n_groups % self.tp_size != 0
+            if redundant_tp:
+                # To handle the general case, redundantly apply the variance
+                x = tensor_model_parallel_all_gather(x, -1)
+
+            *prefix_dims, hidden_dim = x.shape
+            group_count = hidden_dim // self.group_size
+            x_grouped = x.view(*prefix_dims, group_count, self.group_size)
+            variance = x_grouped.pow(2).mean(-1, keepdim=True)
+            x_grouped = x_grouped * torch.rsqrt(variance +
+                                                self.variance_epsilon)
+            x = x_grouped.view(*prefix_dims, hidden_dim)
+
+            if redundant_tp:
+                start = self.per_rank_hidden_size * self.tp_rank
+                end = start + self.per_rank_hidden_size
+                x = x[..., start:end]
+
+        return self.weight * x.to(input_dtype)
+
+    def forward_cuda(
+        self,
+        x: torch.Tensor,
+        gate: torch.Tensor,
+    ) -> Union[torch.Tensor, Tuple[torch.Tensor, torch.Tensor]]:
+
+        if self.tp_size > 1 or self.n_groups != 1:
+            return self.forward_native(x, gate)
+
+        from vllm import _custom_ops as ops
+
+        # cast x and gate to float32 before silu
+        out = torch.empty_like(x)
+        y = x * nn.functional.silu(gate.to(torch.float32))
+        ops.rms_norm(
+            out,
+            y.to(x.dtype),
+            self.weight.data,
+            self.variance_epsilon,
+        )
+        return out


We should get a unit test in place for this, especially the various tensor parallel cases. @fabianlim do you have bandwidth to do that? Otherwise I can do it in either in #9292 or a separate PR. I do feel pretty good about correctness here, having manually tested various cases thoroughly enough.

just unit testing only the Mixer2RMSNormGated? if so how would you setup the test? conftest only has runners for the whole model.

tlrmchlsmth · 2025-02-01T17:39:50Z

vllm/model_executor/models/registry.py

@@ -69,6 +70,7 @@
    "LLaMAForCausalLM": ("llama", "LlamaForCausalLM"),
    "MambaForCausalLM": ("mamba", "MambaForCausalLM"),
    "FalconMambaForCausalLM": ("mamba", "MambaForCausalLM"),
+    "Mamba2ForCausalLM": ("mamba2", "Mamba2ForCausalLM"),


Let's land this in #9292

tlrmchlsmth · 2025-02-01T17:47:13Z

To debug the pre-commit issue locally you may need to run:

pre-commit run mypy-3.9 --hook-stage manual -a

Signed-off-by: Yu Chin Fabian Lim <[email protected]>

fabianlim · 2025-02-01T23:49:40Z

@tlrmchlsmth thank you so much for your comments. I have fixed the pre-commit and reverted changes in a bunch of files; i kept the changes that are still needed to test the PR. Also merging in upstream/main helped github to display the changes correctly. Regarding the new unit test, I left a question for you

initial pr without tp fix

62181d5

Signed-off-by: Yu Chin Fabian Lim <[email protected]>

fabianlim requested review from DarkLight1337 and ywang96 as code owners December 5, 2024 01:35

fabianlim marked this pull request as draft December 5, 2024 01:35

DarkLight1337 requested a review from tlrmchlsmth December 5, 2024 03:17

fix casting in rms norm gated

51bc78c

Signed-off-by: Yu Chin Fabian Lim <[email protected]>

ani300 mentioned this pull request Dec 9, 2024

Running Bamba on vLLM foundation-model-stack/bamba#3

Open

4 tasks

fabianlim added 5 commits December 12, 2024 04:10

TP fix

81b93b4

Signed-off-by: Yu Chin Fabian Lim <[email protected]>

fix mamba scan invalid address

0f93e4a

Signed-off-by: Yu Chin Fabian Lim <[email protected]>

some fixes and remove unused kernels

742ae79

Signed-off-by: Yu Chin Fabian Lim <[email protected]>

fmt + lint

b2dc5ca

Signed-off-by: Yu Chin Fabian Lim <[email protected]>

more comments

9ad9e20

Signed-off-by: Yu Chin Fabian Lim <[email protected]>

fabianlim force-pushed the bamba-pr branch from 154255a to 9ad9e20 Compare December 12, 2024 07:15

fabianlim added 4 commits December 12, 2024 12:20

initial fix for chunked prefill (incomplete)

25bf381

Signed-off-by: Yu Chin Fabian Lim <[email protected]>

improve comments

43ce07c

Signed-off-by: Yu Chin Fabian Lim <[email protected]>

do not attach seq_idx to attn_metadata

80f14b5

Signed-off-by: Yu Chin Fabian Lim <[email protected]>

activate initial states for chunked prefill

6b8ac49

Signed-off-by: Yu Chin Fabian Lim <[email protected]>

tlrmchlsmth reviewed Dec 12, 2024

View reviewed changes

fabianlim added 3 commits December 13, 2024 01:09

reuse softplus and remove triton2 remark

d788db6

Signed-off-by: Yu Chin Fabian Lim <[email protected]>

add comment on weight loader and format

400db27

Signed-off-by: Yu Chin Fabian Lim <[email protected]>

rename test_jamba to test_hybrid and got rid of test_bamba

bda8ea7

Signed-off-by: Yu Chin Fabian Lim <[email protected]>

mergify bot added the needs-rebase label Dec 13, 2024

tlrmchlsmth reviewed Dec 13, 2024

View reviewed changes

fabianlim added 2 commits December 16, 2024 03:17

Merge remote-tracking branch 'upstream/main' into bamba-pr

66078d6

update bamba to ishybrid and support pp

a74de9f

Signed-off-by: Yu Chin Fabian Lim <[email protected]>

tlrmchlsmth and others added 5 commits January 17, 2025 18:01

fixes

39f55d1

Signed-off-by: Tyler Michael Smith <[email protected]>

update test registry, fixes

dd31f19

Signed-off-by: Yu Chin Fabian Lim <[email protected]>

Fix for conv state shape and update placeholder_attn

e2e5aac

Signed-off-by: Tyler Michael Smith <[email protected]> Co-authored-by: Yu Chin Fabian Lim <[email protected]>

back out placeholder_attn changes

bc1b8af

Signed-off-by: Tyler Michael Smith <[email protected]>

make seq_idx to chunk indices more efficient

9db0dd5

Signed-off-by: Yu Chin Fabian Lim <[email protected]>

mergify bot added the needs-rebase label Jan 20, 2025

tlrmchlsmth added 3 commits January 20, 2025 14:28

WIP debugging, restore local mamba and placeholder_attn changes

cd89283

Signed-off-by: Tyler Michael Smith <[email protected]>

Integration tests are now green

9a838a3

Signed-off-by: Tyler Michael Smith <[email protected]>

remove bamba-specific files

be8318e

Signed-off-by: Tyler Michael Smith <[email protected]>

tlrmchlsmth mentioned this pull request Jan 20, 2025

[Model] Support Mamba2 (Codestral Mamba) #9292

Open

5 tasks

yury-tokpanov reviewed Jan 24, 2025

View reviewed changes

Merge branch 'main' into tms/mamba2

f34d434

Signed-off-by: Tyler Michael Smith <[email protected]>

ani300 mentioned this pull request Jan 28, 2025

Fix mask slicing for models with HybridCache huggingface/transformers#35681

Merged

tlrmchlsmth added 3 commits January 30, 2025 18:51

Handle grouping in Mixer2RMSNormGated

a65e2cb

Signed-off-by: Tyler Michael Smith <[email protected]>

debug cruft

0d4bb0f

Signed-off-by: Tyler Michael Smith <[email protected]>

Remove codestral integration test

74f6088

Signed-off-by: Tyler Michael Smith <[email protected]>

fabianlim added 3 commits February 1, 2025 07:01

Merge branch 'tms/mamba2' into bamba-pr

95583b8

Signed-off-by: Yu Chin Fabian Lim <[email protected]>

update mamba_cache

b72389c

Signed-off-by: Yu Chin Fabian Lim <[email protected]>

remove changes to requirements

10d75eb

Signed-off-by: Yu Chin Fabian Lim <[email protected]>

mergify bot removed the needs-rebase label Feb 1, 2025

tlrmchlsmth reviewed Feb 1, 2025

View reviewed changes

fabianlim added 5 commits February 1, 2025 23:06

revert changes

5aea1e6

Signed-off-by: Yu Chin Fabian Lim <[email protected]>

Merge remote-tracking branch 'upstream/main' into bamba-pr

2ee8d07

fix lint

043e006

Signed-off-by: Yu Chin Fabian Lim <[email protected]>

fix lint

7e4ce4f

Signed-off-by: Yu Chin Fabian Lim <[email protected]>

more reverts

8219480

Signed-off-by: Yu Chin Fabian Lim <[email protected]>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add Bamba Model #10909

Add Bamba Model #10909

fabianlim commented Dec 5, 2024 •

edited by github-actions bot

Loading

github-actions bot commented Dec 5, 2024

tlrmchlsmth commented Dec 5, 2024

fabianlim commented Dec 12, 2024 •

edited

Loading

tlrmchlsmth left a comment

tlrmchlsmth Dec 12, 2024

fabianlim Dec 12, 2024

raghukiran1224 Dec 13, 2024

tlrmchlsmth Dec 13, 2024

tlrmchlsmth Jan 2, 2025

mergify bot commented Dec 13, 2024

fabianlim commented Dec 13, 2024

tlrmchlsmth left a comment

mergify bot commented Jan 20, 2025

yury-tokpanov Jan 24, 2025

fabianlim Jan 24, 2025

fabianlim commented Jan 31, 2025

tlrmchlsmth left a comment

tlrmchlsmth Feb 1, 2025

tlrmchlsmth Feb 1, 2025

fabianlim Feb 1, 2025

fabianlim Feb 1, 2025

tlrmchlsmth Feb 1, 2025

fabianlim Feb 1, 2025 •

edited

Loading

tlrmchlsmth Feb 1, 2025

tlrmchlsmth Feb 1, 2025

fabianlim Feb 1, 2025

tlrmchlsmth Feb 1, 2025

tlrmchlsmth commented Feb 1, 2025

fabianlim commented Feb 1, 2025

Add Bamba Model #10909

Are you sure you want to change the base?

Add Bamba Model #10909

Conversation

fabianlim commented Dec 5, 2024 • edited by github-actions bot Loading

github-actions bot commented Dec 5, 2024

tlrmchlsmth commented Dec 5, 2024

fabianlim commented Dec 12, 2024 • edited Loading

tlrmchlsmth left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

mergify bot commented Dec 13, 2024

fabianlim commented Dec 13, 2024

tlrmchlsmth left a comment

Choose a reason for hiding this comment

mergify bot commented Jan 20, 2025

Choose a reason for hiding this comment

Choose a reason for hiding this comment

fabianlim commented Jan 31, 2025

tlrmchlsmth left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

fabianlim Feb 1, 2025 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

tlrmchlsmth commented Feb 1, 2025

fabianlim commented Feb 1, 2025

fabianlim commented Dec 5, 2024 •

edited by github-actions bot

Loading

fabianlim commented Dec 12, 2024 •

edited

Loading

fabianlim Feb 1, 2025 •

edited

Loading